11 research outputs found

    Towards directly modeling raw speech signal for speaker verification using CNNs

    Get PDF
    Speaker verification systems traditionally extract and model cepstral features or filter bank energies from the speech signal. In this paper, inspired by the success of neural network-based approaches to model directly raw speech signal for applications such as speech recognition, emotion recognition and anti-spoofing, we propose a speaker verification approach where speaker discriminative information is directly learned from the speech signal by: (a) first training a CNN-based speaker identification system that takes as input raw speech signal and learns to classify on speakers (unknown to the speaker verification system); and then (b) building a speaker detector for each speaker in the speaker verification system by replacing the output layer of the speaker identification system by two outputs (genuine, impostor), and adapting the system in a discriminative manner with enrollment speech of the speaker and impostor speech data. Our investigations on the Voxforge database shows that this approach can yield systems competitive to state-of-the-art systems. An analysis of the filters in the first convolution layer shows that the filters give emphasis to information in low frequency regions (below 1000 Hz) and implicitly learn to model fundamental frequency information in the speech signal for speaker discrimination

    Long Term Spectral Statistics for Voice Presentation Attack Detection

    Get PDF
    Automatic speaker verification systems can be spoofed through recorded, synthetic or voice converted speech of target speakers. To make these systems practically viable, the detection of such attacks, referred to as presentation attacks, is of paramount interest. In that direction, this paper investigates two aspects: (a) a novel approach to detect presentation attacks where, unlike conventional approaches, no speech signal related assumptions are made, rather the attacks are detected by computing first order and second order spectral statistics and feeding them to a classifier, and (b) generalization of the presentation attack detection systems across databases. Our investigations on Interspeech 2015 ASVspoof challenge dataset and AVspoof dataset show that, when compared to the approaches based on conventional short-term spectral processing, the proposed approach with a linear discriminative classifier yields a better system, irrespective of whether the spoofed signal is replayed to the microphone or is directly injected into the system software process. Cross-database investigations show that neither the short-term spectral processing based approaches nor the proposed approach yield systems which are able to generalize across databases or methods of attack. Thus, revealing the difficulty of the problem and the need for further resources and research

    AudioPaLM: A Large Language Model That Can Speak and Listen

    Full text link
    We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examplesComment: Technical repor

    Overview of BTAS 2016 Speaker Anti-spoofing Competition

    Get PDF
    This paper provides an overview of the Speaker Anti-spoofing Competition organized by Biometric group at Idiap Research Institute for the IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS 2016). The competition used AVspoof database, which contains a comprehensive set of presentation attacks, including, (i) direct replay attacks when a genuine data is played back using a laptop and two phones (Samsung Galaxy S4 and iPhone 3G), (ii) synthesized speech replayed with a laptop, and (iii) speech created with a voice conversion algorithm, also replayed with a laptop. The paper states competition goals, describes the database and the evaluation protocol, discusses solutions for spoofing or presentation attack detection submitted by the participants, and presents the results of the evaluation

    Overview of BTAS 2016 Speaker Anti-spoofing Competition

    Get PDF
    This paper provides an overview of the Speaker Anti-spoofing Competition organized by Biometric group at Idiap Research Institute for the IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS 2016). The competition used AVspoof database, which contains a comprehensive set of presentation attacks, including, (i) direct replay attacks when a genuine data is played back using a laptop and two phones (Samsung Galaxy S4 and iPhone 3G), (ii) synthesized speech replayed with a laptop, and (iii) speech created with a voice conversion algorithm, also replayed with a laptop. The paper states competition goals, describes the database and the evaluation protocol, discusses solutions for spoofing or presentation attack detection submitted by the participants, and presents the results of the evaluation

    Trustworthy speaker recognition with minimal prior knowledge using neural networks

    No full text
    The performance of speaker recognition systems has considerably improved in the last decade. This is mainly due to the development of Gaussian mixture model-based systems and in particular to the use of i-vectors. These systems handle relatively well noise and channel mismatches and yield a low error rate when confronted with zero-effort impostors, i.e. impostors using their own voice but claiming to be someone else. However, speaker verification systems are vulnerable to more sophisticated attacks, called presentation or spoofing attacks. In that case, the impostors present a fake sample to the system, which can either be generated with a speech synthesis or voice conversion algorithm or can be a previous recording of the target speaker. One way to make speaker recognition systems robust to this type of attack is to integrate a presentation attack detection system. Current methods for speaker recognition and presentation attack detection are largely based on short-term spectral processing. This has certain limitations. For instance, state-of-the-art speaker verification systems use cepstral features, which mainly capture vocal tract system characteristics, although voice source characteristics are also speaker discriminative. In the case of presentation attack detection, there is little prior knowledge that can guide us to differentiate bona fide samples from presentation attacks, as they are both speech signals that carry the same high level information, such as message, speaker identity and information about environment. This thesis focuses on developing speaker verification and presentation attack detection systems that rely on minimal assumptions. Towards that, inspired by recent advances in deep learning, we first develop speaker verification approaches where speaker discriminative information is learned from raw waveforms using convolutional neural networks (CNNs). We show that such approaches are capable of learning both voice source related and vocal tract system related speaker discriminative information and yield performance competitive to state of the art systems, namely i-vectors and x-vectors-based systems. We then develop two high performing approaches for presentation attack detection: one based on long-term spectral statistics and the other based on raw speech modeling using CNNs. We show that these two approaches are complementary and make the speaker verification systems robust to presentation attacks. Finally, we develop a visualization method inspired from the computer vision community to gain insight about the task-specific information captured by the CNNs from the raw speech signals

    End-to-End Convolutional Neural Network-based Voice Presentation Attack Detection

    No full text
    Development of countermeasures to detect attacks performed on speaker verification systems through presentation of forged or altered speech samples is a challenging and open research problem. Typically, this problem is approached by extracting features through conventional short-term speech processing and feeding them to a binary classifier. In this article, we develop a convolutional neural network-based approach that learns in an end-to-end manner both the features and the binary classifier from the raw signal. Through investigations on two publicly available databases, namely, ASVspoof and AVspoof, we show that the proposed approach yields systems comparable to or better than the state-of-the-art approaches for both physical access attacks and logical access attacks. Furthermore, the approach is shown to be complementary to a spectral statistics-based approach, which, similarly to the proposed approach, does not use prior assumptions related to speech signals

    On Learning to Identify Genders from Raw Speech Signal Using CNNs

    No full text
    Automatic Gender Recognition (AGR) is the task of identifying the gender of a speaker given a speech signal. Standard approaches extract features like fundamental frequency and cepstral features from the speech signal and train a binary classifier. Inspired from recent works in the area of automatic speech recognition (ASR), speaker recognition and presentation attack detection, we present a novel approach where relevant features and classifier are jointly learned from the raw speech signal in end-to-end manner. We propose a convolutional neural networks (CNN) based gender classifier that consists of: (1) convolution layers, which can be interpreted as a feature learning stage and (2) a multilayer perceptron (MLP), which can be interpreted as a classification stage. The system takes raw speech signal as input, and outputs gender posterior probabilities. Experimental studies conducted on two datasets, namely AVspoof and ASVspoof 2015, with different architectures show that with simple architectures the proposed approach yields better system than standard acoustic features based approach. Further analysis of the CNNs show that the CNNs learn formant and fundamental frequency information for gender identification

    Understanding and Visualizing Raw Waveform-based CNNs

    No full text
    Modeling directly raw waveforms through neural networks for speech processing is gaining more and more attention. Despite its varied success, a question that remains is: what kind of information are such neural networks capturing or learning for different tasks from the speech signal? Such an insight is not only interesting for advancing those techniques but also for understanding better speech signal characteristics. This paper takes a step in that direction, where we develop a gradient based approach to estimate the relevance of each speech sample input on the output score. We show that analysis of the resulting ``relevance signal" through conventional speech signal processing techniques can reveal the information modeled by the whole network. We demonstrate the potential of the proposed approach by analyzing raw waveform CNN-based phone recognition and speaker identification systems
    corecore